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Abstract 

We present a method for learning max- weight matching predictors in bipartite 
graphs. The method consists of performing maximum a posteriori estimation in 
exponential families with sufficient statistics that encode permutations and data 
features. Although inference is in general hard, we show that for one very rele- 
vant application-web page ranking-exact inference is efficient. For general model 
instances, an appropriate sampler is readily available. Contrary to existing max- 
margin matching models, our approach is statistically consistent and, in addition, 
experiments with increasing sample sizes indicate superior improvement over such 
models. We apply the method to graph matching in computer vision as well as to 
a standard benchmark dataset for learning web page ranking, in which we obtain 
state-of-the-art results, in particular improving on max-margin variants. The draw- 
back of this method with respect to max-margin alternatives is its runtime for large 
graphs, which is comparatively high. 

1 Introduction 

The Maximum- Weight Bipartite Matching Problem (henceforth 'matching problem') 
is a fundamental problem in combinatorial optimization [ ]. This is the problem of 
finding the 'heaviest' perfect match in a weighted bipartite graph. An exact optimal 
solution can be found in cubic time by standard methods such as the Hungarian algo- 
rithm. 

This problem is of practical interest because it can nicely model real-world applica- 
tions. For example, in computer vision the crucial problem of finding a correspondence 
between sets of image features is often modeled as a matching problem [2, 3]. Ranking 
algorithms can be based on a matching framework [19], as can clustering algorithms 

[14,11]. 

When modeling a problem as one of matching, one central question is the choice of 
the weight matrix. The problem is that in real applications we typically observe edge 
feature vectors, not edge weights. Consider a concrete example in computer vision: it 
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is difficult to tell what the 'similarity score' is between two image feature points, but it 
is straightforward to extract feature vectors (e.g. SIFT) associated with those points. 

In this setting, it is natural to ask whether we could parameterize the features, and 
use labeled matches in order to estimate the parameters such that, given graphs with 
'similar' features, their resulting max- weight matches are also 'similar'. This idea 
of 'parameterizing algorithms' and then optimizing for agreement with data is called 
structured estimation [31, 33]. 

[31] and [3] describe max-margin structured estimation formalisms for this prob- 
lem. Max-margin structured estimators are appealing in that they try to minimize the 
loss that one really cares about ('structured losses', of which the Hamming loss is an 
example). However structured losses are typically piecewise constant in the parame- 
ters, which eliminates any hope of using smooth optimization directly. Max-margin 
estimators instead minimize a surrogate loss which is easier to optimize, namely a con- 
vex upper bound on the structured loss [33]. In practice the results are often good, 
but known convex relaxations produce estimators which are statistically inconsistent 
[22], i.e., the algorithm in general fails to obtain the best attainable model in the limit 
of infinite training data. The inconsistency of multiclass support vector machines is a 
well-known issue in the literature that has received careful examination recently [8, 7]. 

Motivated by the inconsistency issues of max-margin structured estimators as well 
as by the well-known benefits of having a full probabilistic model, in this paper we 
present a maximum a posteriori (MAP) estimator for the matching problem. The ob- 
served data are the edge feature vectors and the labeled matches provided for training. 
We then maximize the conditional posterior likelihood of matches given the observed 
data. We build an exponential family model where the sufficient statistics are such that 
the mode of the distribution (the prediction) is the solution of a max- weight matching 
problem. The resulting partition function is t(P-complete to compute exactly. However, 
we show that for learning to rank applications the model instance is tractable. We then 
compare the performance of our model instance against a large number of state-of-the- 
art ranking methods, including DORM [ ], an approach that only differs to our model 
instance by using max-margin instead of a MAP formulation. We show very compet- 
itive results on standard webpage ranking datasets, and in particular we show that our 
model performs better than or on par with DORM. For intractable model instances, we 
show that the problem can be approximately solved using sampling and we provide 
experiments from the computer vision domain. However the fastest suitable sampler is 
still quite slow for large models, in which case max-margin matching estimators like 
those of [3] and [31] are likely to be preferable even in spite of their potential inferior 
accuracy. 
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2 Background 



2.1 Structured Prediction 

In recent years, great attention has been devoted in Machine Learning to so-called 
structured predictors, which are predictors of the kind 

ge : X i-> y, (1) 

where X is an arbitrary input space and y is an arbitrary discrete space, typically expo- 
nentially large, y may be, for example, a space of matrices, trees, graphs, sequences, 
strings, matches, etc. This structured nature of y is what structured prediction refers to. 
In the setting of this paper, X is the set of vector- weighted bipartite graphs (i.e., each 
edge has a feature vector associated to it), and y is the set of perfect matches induced 
by X. If N graphs are available, along with corresponding annotated matches (i.e., a set 
{(x n , y n )}^ =1 ), our task will be to estimate 6 such that when we apply the predictor 
ge to a new graph it produces a match that is similar to matches of similar graphs from 
the annotated set. Structured learning or structured estimation refers to the process of 
estimating a vector 6 for predictor ge when data {(x 1 , y 1 ), . . . , (x N , y N )} G (X x y) N 
are available. Structured prediction for input x means computing y — g(x;0) using the 
estimated 6. 

Two generic estimation strategies have been popular in producing structured predic- 
tors. One is based on max-margin estimators [33, 32, 31], and the other on maximum- 
likelihood (ML) or MAP estimators in exponential family models [18]. 

The first approach is a generalization of support vector machines to the case where 
the set y is structured. However the resulting estimators are known to be inconsistent 
in general: in the limit of infinite training data the algorithm fails to recover the best 
model in the model class [22, 7, 8]. McAllester recently provided an interesting anal- 
ysis on this issue, where he proposed new upper bounds whose minimization results 
in consistent estimators, but no such bounds are convex [ ]. The other approach uses 
ML or MAP estimation in conditional exponential families with 'structured' sufficient 
statistics, such as in probabilistic graphical models, where they are decomposed over 
the cliques of the graph (in which case they are called Conditional Random Fields, or 
CRFs [ ]). In the case of tractable graphical models, dynamic programming can be 
used to efficiently perform inference. Other tractable models of this type include mod- 
els that predict spanning trees and models that predict binary labelings in planar graphs 
[9, 17]. ML and MAP estimators in exponential families not only amount to solving 
an unconstrained and convex optimization problem; in addition they are statistically 
consistent. The main problem with these types of models is that often the partition 
function is intractable. This has motivated the use of max-margin methods in many 
scenarios where such intractability arises. 

2.2 The Matching Problem 

Consider a weighted bipartite graph with m nodes in each part, G = (V,E,w), where 
V is the set of vertices, E is the set of edges and w : E ^ R is a set of real- valued 
weights associated to the edges. G can be simply represented by a matrix (i%) where 
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G x 



Wij = {xij,9} 



Figure 1 : Left: Illustration of an input vector-weighted bipartite graph G x with 3x3 
edges. There is a vector x e associated to each edge e (for clarity only is shown, 
corresponding to the solid edge). Right: weighted bipartite graph G obtained by eval- 
uating G x on the learned vector 6 (again only edge ij is shown). 



the entry is the weight of the edge ij. Consider also a bijection y : {1, 2, . . . , m} i— > 
{1,2,..., m}, i.e., a permutation. Then the matching problem consists of computing 



m 

y* = argmax ^ w iy{{) . (2) 

y i=i 

This is a well-studied problem; it is tractable and can be solved in 0(m 3 ) time [16, 26]. 
This model can be used to match features in images [ ], improve classification algo- 
rithms [ ] and rank webpages [ ], to cite a few applications. The typical setting 
consists of engineering the score matrix Wij according to domain knowledge and sub- 
sequently solving the combinatorial problem. 



3 The Model 

3.1 Basic Goal 

In this paper we assume that the weights are instead to be estimated from training 
data. More precisely, the weight wij associated to the edge ij in a graph will be the 
result of an appropriate composition of & feature vector (observed) and a parameter 
vector (estimated from training data). Therefore, in practice, our input is a vector- 
weighted bipartite graph G x = (V,E,x) (x : E i— > R n ), which is 'evaluated' at a 
particular (obtained from previous training) so as to attain the graph G — (V,E,w). 
See Figure 1 for an illustration. 

More formally, assume that a training set {X, Y} = {(x n , y n )}^=i is available, 
where x n := (x^, x^ • • • , x7 M( n )M{n))- ^ ere M(n) is the number of nodes in each 
part of the vector- weighted bipartite graph x n . We then parameterize x^ as w iy ^ = 
f{xi y (iy,Q), and the goal is to find the 6 which maximizes the posterior likelihood of 
the observed data. We will assume / to be bilinear, i.e., f(x iy ^;6) = (x iy ^yO). 
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3.2 Exponential Family Model 

We assume an exponential family model, where the probability model is 

p(y\x] 0) = exp ((00, y), 0) - g(x; 0)), where 



(3) 



g(x]0) = log J2 ex P (tfay), 0) (4) 
y 

is the log-partition function, which is a convex and differentiable function of [35]. 
The prediction in this model is the most likely y, i.e., 

y* = argmaxp(?/|x; 0) — argmax (4>(x, y), 0) (5) 
y y 

and ML estimation amounts to maximizing the conditional likelihood of the training set 
{X, Y}, i.e., computing argmax^ p(Y\X; 0). In practice we will in general introduce 
a prior on and perform MAP estimation: 

(9* = argmaxp(y |X; 0)p(0) = argmax p(<9|F, X). (6) 

9 

Assuming iid sampling, we havep(F|X; 0) = Yl^ =1 p(y n \x n ] 0). Therefore, 

N 

p(6\Y, X) cx p(0) exp ((<f>(x n , y n ), 0) - g(x n ; 0)) 

n=l 

= exp (^ogp(0) + £ (Mx n , V n ), 0) - 9(x n ; 0))j . (7) 

We impose a Gaussian prior on 0. Instead of maximizing the posterior we can in- 
stead minimize the negative log-posterior £(Y\X; 0), which becomes our loss function 
(we suppress the constant term): 

A 1 N 

£(Y\X; 0)=-\\0f + -^ ( 9 (x n ; 0) - y»), 0)) (8) 

n=l 

where A is a regularization constant. £(Y \X\ 0) is a convex function of since the log- 
partition function g(0) is a convex function of [35] and the other terms are clearly 
convex in 0. 

3.3 Feature Parameterization 

The critical observation now is that we equate the solution of the matching problem (2) 
to the prediction of the exponential family model (5), i.e., J2i w iy(i) ~ Z/)> 
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Since our goal is to parameterize features of individual pairs of nodes (so as to produce 
the weight of an edge), the most natural model is 

M 

<\>{x, y) = ^2 x Mi) ' which S ives ( 9 ) 

U>iy(i) = ( x iy(i),0) > ( 10 ) 

i.e., linear in both x and 6 (see Figure 1, right). The specific form for Xij will be dis- 
cussed in the experimental section. In light of (10), (2) now clearly means a prediction 
of the best match for G x under the model 6. 



4 Learning the Model 

4.1 Basics 

We need to solve 0* = argmin^ £(Y\X; 9). £(Y\X; 6) is a convex and differentiable 
function of 6 [35], therefore gradient descent will find the global optimum. In order to 
compute Vei(Y\X]0), we need to compute Vog{0). It is a standard result of expo- 
nential families that the gradient of the log-partition function is the expectation of the 
sufficient statistics: 

V e g(x;0) =E y ^ p{ylx . e) [(j)(x,y)]. (11) 

Therefore in order to perform gradient descent we need to compute the above expecta- 
tion. Opening the above expression gives 

E 3/~p(3/i*;0) y)} = ^ y)p(y\ x ' °) ( 12 ) 

y 

1 M 
= Z ( x . e ^ ^H x ,y)n e M(xiy(i),Q)), (i3) 

which reveals that the partition function Z(x\Q) needs to be computed. The partition 
function is: 

M 

Z(x; 0) = II exp((s iyW ,fl)) . (14) 

y i=l V ' 

— B iy(i) 

Note that the above is the expression for the permanent of matrix B [ ]. The per- 
manent is similar in definition to the determinant, the difference being that for the 
latter sgn(?/) comes before the product. However, unlike the determinant, which is 
computable efficiently and exactly by standard linear algebra manipulations [ ], com- 
puting the permanent is a t(P-complete problem [ ]. Therefore we have no realistic 
hope of computing (11) exactly for general problems. 
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4.2 Exact Expectation 



The exact partition function itself can be efficiently computed for up to about M = 30 
using the 0(M2 M ) algorithm by Ryser [29]. However for arbitrary expectations we 
are not aware of any exact algorithm which is more efficient than full enumeration 
(which would constrain tractability to very small graphs). However we will see that 
even in the case of very small graphs we find a very important application: learning 
to rank. In our experiments, we successfully apply a tractable instance of our model 
to benchmark webpage ranking datasets, obtaining very competitive results. For larger 
graphs, we have alternative options as indicated below. 

4.3 Approximate Expectation 

If we have a situation in which the set of feasible permutations is too large to be fully 
enumerated efficiently, we need to resort to some approximation for the expectation of 
the sufficient statistics. The best solution we are aware of is one by Huber and Law, who 
recently presented an algorithm to approximate the permanent of dense non-negative 
matrices [13]. The algorithm works by producing exact samples from the distribution 
of perfect matches on weighted bipartite graphs. This is in precisely the same form as 
the distribution we have here, p(y\x; 6) [13]. We can use this algorithm for applications 
that involve larger graphs. 1 We generate K samples from the distribution p(y\x; 6), and 
directly approximate (12) with a Monte Carlo estimate 



In our experiments, we apply this algorithm to an image matching application. 

5 Experiments 
5.1 Ranking 

Here we apply the general matching model introduced in previous sections to the task 
of learning to rank. Ranking is a fundamental problem with applications in diverse 
areas such as document retrieval, recommender systems, product rating and others. We 
focus on web page ranking. 

We are given a set of queries {qk} and, for each query q^, a list of D(k) documents 
{d\, . . . , d k D ^} with corresponding ratings {r\ , . . . , r k D ^} (assigned by a human ed- 
itor), measuring the relevance degree of each document with respect to query q^. A 
rating or relevance degree is usually a nominal value in the list {1, . . . , R}, where R is 
typically between 2 and 5. We are also given, for every retrieved document d\, a joint 
feature vector ipf for that document and the query q^ . 

Training At training time, we model each query q^ as a vector- weighted bipartite 
graph (Figure 1) where the nodes on one side correspond to a subset of cardinality M of 

lr The algorithm is described in the appendix. 
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all D(k) documents retrieved by the query, and the nodes on the other side correspond 
to all possible ranking positions for these documents (1, . . . , M). The subset itself is 
chosen randomly, provided at least one exemplar document of every rating is present. 
Therefore M must be such that M > R. 

The process is then repeated in a bootstrap manner: we resample (with replace- 
ment) from the set of documents {d\ , . . . , d k D ^}, M documents at a time (conditioned 
on the fact that at least one exemplar of every rating is present, but otherwise randomly). 
This effectively boosts the number of training examples since each query ends up 
being selected many times, each time with a different subset of M documents from the 
original set of D(k) documents. 

In the following we drop the query index k to examine a single query. Here we 
follow the construction used in [19] to map matching problems to ranking problems 
(indeed the only difference between our ranking model and that of [ ] is that they use 
a max-margin estimator and we use MAP in an exponential family.) Our edge feature 
vector Xij will be the product of the feature vector ipi associated with document i, 
and a scalar Cj (the choice of which will be explained below) associated with ranking 
position j 

Xij = TpiCj. (16) 

ipi is dataset specific (see details below). From (10) and (16), we have Wij = Cj 0), 
and training proceeds as explained in Section 4. 

Testing At test time, we are given a query q and its corresponding list of D associ- 
ated documents. We then have to solve the prediction problem, i.e., 

D D 

y* =argmax^(x i2/(i) ,6>) = argmax^ c yii) (^,0) . (17) 
y i=i y i=i 

We now notice that if the scalar Cj = c(j), where c is a non-increasing function of 
rank position j, then (17) can be solved simply by sorting the values of (ipi,0) in 
decreasing order. 2 In other words, the matching problem becomes one of ranking the 
values (ipi,6). Inference in our model is therefore very fast (linear time). 3 In this 
setting it makes sense to interpret the quantity (ipi : 0) as a score of document di for 
query q. This leaves open the question of which non-increasing function c should be 
used. We do not solve this problem in this paper, and instead choose a fixed c. In 
theory it is possible to optimize over c during learning, but in that case the optimization 
problem would no longer be convex. We describe the results of our method on LETOR 
2.0 [20], a publicly available benchmark data collection for comparing learning to rank 
algorithms. It is comprised of three data sets: OHSUMED, TD2003 and TD2004. 

Data sets OHSUMED contains features extracted from query-document pairs in 
the OHSUMED collection, a subset of MEDLINE, a database of medical publications. 
It contains 106 queries. For each query there are a number of associated documents, 
with relevance degrees judged by humans on three levels: definitely, possibly or not 

2 If r(v) denotes the vector of ranks of entries of vector v, then (a, 7r(6)) is maximized by the permutation 
7T* such that r(a) = r(7r* (6)), a theorem due to Polya, Littlewood, Hardy and Blackwell [30]. 
3 Sorting the top k items of a list of D items takes 0(k log k + D) time [21]. 
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Figure 2: Results of NDCG@k for state-of-the-art methods on TD2004 (left), TD2003 
(middle) and OHSUMED (right). This is best viewed in color. 



relevant. Each query-document pair is associated with a 25 dimensional feature vector, 
ipi. The total number of query-document pairs is 16,140. TD2003 and TD2004 contain 
features extracted from the topic distillation tasks of TREC 2003 and TREC 2004, with 
50 and 75 queries, respectively. Again, for each query there are a number of associated 
documents, with relevance degrees judged by humans, but in this case only two levels 
are provided: relevant or not relevant. Each query-document pair is associated with a 
44 dimensional feature vector, ^ . The total number of query-document pairs is 49,171 
for TD2003 and 74,170 for TD2004. All datasets are already partitioned for 5-fold 
cross-validation. See [20] for more details. 

Evaluation Metrics In order to measure the effectiveness of our method we use the 
normalized discount cumulative gain (NDCG) measure [ ] at rank position k, which 
is defined as 

i *L o r W) — 1 

NDCG@k = - ^ i n^v (18) 
Z^log(l+j) 

where r(j) is the relevance of the j th document in the list, and Z is a normalization 
constant so that a perfect ranking yields an NDCG score of 1. 

External Parameters The regularization constant A is chosen by 5 -fold cross- 
validation, with the partition provided by the LETOR package. All experiments are 
repeated 5 times to account for the randomness of the sampling of the training data. 
We use c(j) = M — j on all experiments. 

Optimization To optimize (8) we use a standard BFGS Quasi-Newton method with 
a backtracking line search, as described in [25]. 

Results For the first experiment training was done on subsets sampled as described 
above, where for each query we sampled OA- D(k) - M subsets, therefore increasing 
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the number of samples linearly with M. For TD2003 we also trained with all possible 
subsets (M = 2 (all) in the plots). In Figure 2 we plot the results of our method (named 
RankMatch), for M = R, compared to those achieved by a number of state-of-the-art 
methods which have published NDCG scores in at least two of the datasets: RankBoost 
[ ], RankS VM [ ], FRank [5], ListNet [ ], AdaRank [ ], QBRank [ ], IsoRank 
[ ], SortNet [ ], StructRank [ ] and C-CRF [ ]. We also included a plot of our 
implementation of DORM [19], using precisely the same resampling methodology and 
data for a fair comparison. RankMatch performs among the best methods on both 
TD2004 and OHSUMED, while on TD2003 it performs poorly (for low k) or fairly 
well (for high k). 

We notice that there are four methods which only report results in two of the three 
datasets: the two SortNet versions are only reported on TD2003 and TD2004, while 
StructRank and C-CRF are only reported on TD2004 and OHSUMED. RankMatch 
compares similarly with SortNet and StructRank on TD2004, similarly to C-CRF and 
StructRank on OHSUMED and similarly to the two versions of SortNet on TD2003. 
This exhausts all the comparisons against the methods which have results reported 
in only two datasets. A fairer comparison could be made if these methods had their 
performance published for the respective missing dataset. 

When compared to the methods which report results in all datasets, RankMatch 
entirely dominates their performance on TD2004 and is second only to IsoRank on 
OHSUMED (and performing similarly to QBRank). 

These results should be interpreted cautiously; [ ] presents an interesting discus- 
sion about issues with these datasets. Also, benchmarking of ranking algorithms is still 
in its infancy and we don't yet have publicly available code for all of the competitive 
methods. We expect this situation to change in the near future so that we are able to 
compare them on a fair and transparent basis. 

Consistency In a second experiment we trained RankMatch with different training 
subset sizes, starting with 0.03 • D{k) • M and going up to 1.0 • D{k) • M. Once again, 
we repeated the experiments with DORM using precisely the same training subsets. 
The purpose here is to see whether we observe a practical advantage of our method 
with increasing sample size, since statistical consistency only provides an asymptotic 
indication. The results are plotted in Figure 3 -right, where we can see that, as more 
training data is available, RankMatch improves more saliently than DORM. 

Runtime The runtime of our algorithm is competitive with that of max-margin 
for small graphs, such as those that arise from the ranking application. For larger 
graphs, the use of the sampling algorithm will result in much slower runtimes than 
those typically obtained in the max-margin framework. This is certainly the benefit of 
the max-margin matching formulations of [3, 19]: it is much faster for large graphs. 
Table 1 shows the runtimes for graphs of different sizes, both for exponential family 
and max-margin matching models. 
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Table 1 : Training times (per observation, in seconds) for the exponential model and 
max-margin. Runtimes for M = 3, 4, 5 are from the ranking experiments, computed 
by full enumeration; M — 20 corresponds to the image matching experiments, which 
use the sampler from [ ]. A problem of size 20 cannot be practically solved by full 
enumeration. 

M exponential model max margin 
0.0006661 0.0008965 

4 0.0011277 0.0016086 

5 0.0030187 0.0015328 
20 36.0300000 0.9334556 



5.2 Image Matching 

For our computer vision application we used a silhouette image from the Mythologi- 
cal Creatures 2D database 4 . We randomly selected 20 points on the silhouette as our 
interest points and applied shear to the image creating 200 different images. We then 
randomly selected N pairs of images for training, N for validation and 500 for testing, 
and trained our model to match the interest points in the pairs. In this setup, 

^•H^-^f, (19) 

where | • | denotes the elementwise difference and ipi is the Shape Context feature vector 
[ ] for point i. 

For a graph of this size computing the exact expectation is not feasible, so we 
used the sampling method described in Section 4.3. Once again, the regularization 
constant A was chosen by cross-validation. Given the fact that the MAP estimator 
is consistent while the max-margin estimator is not, one is tempted to investigate the 
practical performance of both estimators as the sample size grows. However, since 
consistency is only an asymptotic property, and also since the Hamming loss is not 
the criterion optimized by either estimator, this does not imply a better large-sample 
performance of MAP in real experiments. In any case, we present results with varying 
training set sizes in Figure 3 -left. The max-margin method is that of [ ]. After a 
sufficiently large training set size, our model seems to enjoy a slight advantage. 



6 Conclusion and Discussion 

We presented a method for learning max- weight bipartite matching predictors, and ap- 
plied it extensively to well-known webpage ranking datasets, obtaining state-of-the-art 
results. We also illustrated-with an image matching application-that larger problems 
can also be solved, albeit slowly, with a recently developed sampler. The method has a 
number of convenient features. First, it consists of performing maximum- a-posteriori 
estimation in an exponential family model, which results in a simple unconstrained 
convex optimization problem solvable by standard algorithms such as BFGS. Second, 

4 http://tosca.cs.technion.ac.il 
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Figure 3: Performance with increasing sample size. Left: hamming loss for different 
numbers of training pairs in the image matching problem (test set size fixed to 500 
pairs). Right: results of NDCG@1 on the ranking dataset OHSUMED. This evidence 
is in agreement with the fact that our estimator is consistent, while max-margin is not. 

the estimator is not only statistically consistent but also in practice it seems to benefit 
more from increasing sample sizes than its max-margin alternative. Finally, being fully 
probabilistic, the model can be easily integrated as a module in a Bayesian framework, 
for example. The main direction for future research consists of finding more efficient 
ways to solve large problems. This will most likely arise from appropriate exploitation 
of data sparsity in the permutation group. 
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Appendix A 

For completeness we include a description of the sampling algorithm presented in [13]. 
The algorithm is an accept-reject algorithm. The core idea of such an algorithm is very 
simple: assume we need to sample from a distribution p in a given domain M, but that 
such a task is intractable. Instead, we sample from a distribution q in a superset 3sf 
of the original domain (in which sampling is easier), whose restriction to the original 
domain coincides with the original distribution: q\^ = p. We then only 'accept' those 
samples that effectively fall within the original domain M. Clearly, the efficiency of 
such a procedure will be dictated by (i) how efficient it is to sample from q in N and 
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(ii) how much mass of q is in M. Roughly speaking, the algorithm presented in [13] 
manages to sample perfect matches of bipartite graphs such that both conditions (i) and 
(ii) are favorable. 

The reasoning goes as follows: the problem consists of generating variates y G 
y (y is a match) with the property that p(y) = w(y)/Z, where w(y) is the non- 
negative score of match y and Z = ^ y w(y) is the partition function, which in our 
case is a permanent as discussed in Section 4.1. We first partition the space y into 
Vi, . . . ,^1, where ^ = {2/ : y(l) = z}. Each part has its own partition function 

= . ^(2/)- Next, a suitable upper bound Utyi) > Zi on the partition function 

is constructed such that the following two properties hold: 5 



A I 



(pi) 52u(yi)<u(y). 



i=l 

(P2) If |^| = 1, then U%) = Z, = w(y). 

That is, (i) the upper bound is super-additive in the elements of the partition and (ii) 
if ]$i has a single match, the upper bound equals the partition function, which in this 
case is just the score of that match. 

Now the algorithm: consider the random variable 3 where p(3 = i) = U (tyi) /U (y ) . 
By (PI), Y^iLiP(i) ^ 1> so assume p(J = 0) = 1 — YliLiPif)- Now, draw a variate 
from this distribution, and if J = i = 0, reject and restart, otherwise recursively sample 
in y^. 6 This algorithm either stops and restarts or it reaches yfi na i which consists of a 
match, i.e., |yfi na i| = 1- This match is then a legitimate sample from p(y). The reason 
this is the case is because of (P2), as shown below. Assuming the algorithm finishes 
after k samples, the probability of the match is the telescopic product 



u()) m ) ^J(2)) u(Hk)) (pj) Hid 

and since the probability of acceptance is Z/U(^), we have 



(20) 



p{y) = zjum = ~z~' ( } 

which is indeed the distribution from which we want to sample. For pseudocode and a 
rigorous presentation of the algorithm, see [13]. 



5 See [ ] for details. 

6 Due to the self-reducibility of permutations, when we fix y(l) = i, what remains is also a set of 
permutations. We then sample y(2), y(3) . . . y(M). 
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